Exercise 1

About the project

This course will be an introduction to R and Github.

Here is the link to my IODS page
link


Exercise 2 Data analysis

Read the dataset from the local folder and analyze the structure of the dataset

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")  
learning2014 <- read.csv("learning2014.csv", header=TRUE)  
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Deep    : int  43 35 42 42 44 57 46 39 52 48 ...
##  $ Stra    : int  27 22 29 25 29 29 18 32 34 28 ...
##  $ Surf    : int  31 38 27 27 34 29 23 34 26 36 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)  
## [1] 166   8

The “learning2014” dataset constitutes of 166 observations and 8 variables. The variables (gender, Age, Attitude, Deep, Stra, Surf and Points) describe the results of a query study made as a collaboration international project with teachers.

  • Gender = 1 corresponds to male and 2 to female students
  • Age = the age of the students
  • Attitude = Global attitude toward statistics, computed as a sum of numerous variables
  • Deep = Deep approach, computed as a sum of numerous variables
  • Stra = Surface approach, computed as a sum of numerous variables
  • Surf = Surface approach, computed as a sum of numerous variables
  • Points = Max points of social studies

Show a graphical overview of the data and show summaries of the variables in the data

summary(learning2014)
##        X          gender       Age           Attitude          Deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   :19.00  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:40.00  
##  Median : 83.50           Median :22.00   Median :32.00   Median :44.00  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :44.16  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:49.00  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :59.00  
##       Stra            Surf           Points     
##  Min.   :10.00   Min.   :19.00   Min.   : 7.00  
##  1st Qu.:21.00   1st Qu.:29.00   1st Qu.:19.00  
##  Median :25.50   Median :34.00   Median :23.00  
##  Mean   :24.97   Mean   :33.45   Mean   :22.72  
##  3rd Qu.:29.00   3rd Qu.:38.00   3rd Qu.:27.75  
##  Max.   :40.00   Max.   :52.00   Max.   :33.00

There are 110 female and 56 male students. The median age is 22 years and range [17-55]. Students got a median of 44 points [19-59] in the deep approach (Deep), 25.50 [10-40] in the strategic approach (Stra), 34 points [19-52] in the surface approach (Surf). Their max points were 23 [7-33] as a median (Points).

pairs(learning2014[-1], col=learning2014$gender)

The data is visualised according to gender (males as black dots and females as red dots) on multiple bivariable correlation scatter plots.

The data can also be visualised using the “GGally” and “ggplot2” packages

library("GGally")  
library("ggplot2")  
p <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))  
p

Now we can see that male students have a higher score in attitude questions and female students in strategic questions. We can also see that Variables Points and Attitude have the highest (cor: 0.43) and Surf and Deep second highest correlation factor (cor: 0.32).

Regression model

The “Points” student achieve is tested with a multivariate linear regression model using Attitude, Stra and Surf as independent variables. The model tries to describe any linear correlation between the independent and dependent variables.

points_regression <- lm(Points ~ Attitude + Stra + Surf, data = learning2014)  
summary(points_regression)  
## 
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## Stra         0.10664    0.06770   1.575  0.11716    
## Surf        -0.04884    0.06678  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In the first model with all three variables, Attitude explains the Points with a coefficient estimate of 0.34 (p<0.0001). Other variables Stra and Surf do not explain Points with a statistical significance. Thus, first Surf is removed and the model is tested again.

points_regression <- lm(Points ~ Attitude + Stra, data = learning2014)  
summary(points_regression)    
## 
## Call:
## lm(formula = Points ~ Attitude + Stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## Stra         0.11421    0.06681   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Attitude is still the only relevant variable and, thus, Stra is removed and the model is run with Attitude alone (coefficient 0.35, p<0.0001).

points_regression <- lm(Points ~ Attitude, data = learning2014) 
summary(points_regression)      
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The adjusted R-squared score is 0.19 meaning that using this model, Attitude explains 19% of the total variance of the dependent variable Points.

Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

plot(points_regression, which = c(1,2,5))

Residuals of the linear regression model describe the validity of the model. Residuals should be normally distributed, they should not correlate with each others, have a constant varaiance and their size should not depend on the size of the independent variable.
The QQ-plot describes the distribution of the residuals. Our model is normally divided as the measurements are set on straight increasing line.
The Residuals vs Fitted plot describes how the residuals depend on the explanatory variable. According to the plot, the resudials are resonably evenly distributed on both sides of the residual level = 0 line and, thus, do not depend on the size of the explanatory variable.
The Residuals vs Leverage plot can help identify, which observations have an unusually high impact on the model. According to the plot, all measurements are divided between [0,0.05] and thus none have an unusual impact on the model.


Exercise 3 Data analysis

Read the dataset and analyze the structure of the dataset

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv", header = TRUE)
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"   "edu"
dim(alc)
## [1] 382  37

The data constitutes of 382 observations and 36 variables. Their names are presented in the above and describe the correlation between alcohol usage and the social, gender and study time attributes for each student.

Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

Personal hypothesis of the data

High amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption

Numerically and graphically explore the distributions of your variables

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
g1 <- ggplot(alc, aes(x = alc_use, y = absences, fill = sex))
g1 + geom_bar(stat="identity") + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

g2 <- ggplot(alc, aes(x = alc_use, y = (famsup = "yes"), fill=sex))
g2 + geom_bar(stat="identity") + ylab("family educational support") + ggtitle("Student family educational support by alcohol consumption")

g3 <- ggplot(data = alc, aes(x = alc_use, y = studytime, fill=sex))
g3 + geom_bar(stat="identity") + ylab("weekly study time") + ggtitle("Student weekly study time by alcohol consumption and sex")

g4 <- ggplot(alc, aes(x = alc_use, y = famrel, fill = sex))
g4 + geom_bar(stat="identity") + ylab("quality of family relationships") + ggtitle("Student quality of family relationships by alcohol consumption and sex")

alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 × 3
##   famsup count mean_alc_use
##   <fctr> <int>        <dbl>
## 1     no   144     1.965278
## 2    yes   238     1.842437
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 9 × 3
##   alc_use count mean_absences
##     <dbl> <int>         <dbl>
## 1     1.0   140      3.357143
## 2     1.5    69      4.231884
## 3     2.0    59      3.915254
## 4     2.5    44      6.431818
## 5     3.0    32      6.093750
## 6     3.5    17      5.647059
## 7     4.0     9      6.000000
## 8     4.5     3     12.000000
## 9     5.0     9      6.888889
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_studytime = mean(studytime))
## # A tibble: 9 × 3
##   alc_use count mean_studytime
##     <dbl> <int>          <dbl>
## 1     1.0   140       2.307143
## 2     1.5    69       1.971014
## 3     2.0    59       1.983051
## 4     2.5    44       1.931818
## 5     3.0    32       1.718750
## 6     3.5    17       1.470588
## 7     4.0     9       1.777778
## 8     4.5     3       2.000000
## 9     5.0     9       1.666667
alc %>% group_by(famrel) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
##   famrel count mean_alc_use
##    <int> <int>        <dbl>
## 1      1     8     2.250000
## 2      2    19     2.184211
## 3      3    64     2.000000
## 4      4   189     1.880952
## 5      5   102     1.750000

Lack of family educational support and bad quality of family relationships seem to correlate with higher alcohol consumption. Additionally, high alcohol consumption correlates with high absences from school. In addition, students with high alcohol consumption (alc_use > 3.0) study less than students with low alcohol comsumption (alc_use < 2.5). These findings are concordant with the primary hypotheses.

Logistic regression

alc$famrel <- factor(alc$famrel)
alc$studytime <- factor(alc$studytime)
m <- glm(high_use ~ absences + famsup + alc$studytime + alc$famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + famsup + alc$studytime + 
##     alc$famrel, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1803  -0.8226  -0.6503   1.1522   2.2409  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.015455   0.878444  -1.156 0.247693    
## absences        0.076934   0.022630   3.400 0.000675 ***
## famsupyes      -0.061188   0.245702  -0.249 0.803337    
## alc$studytime2 -0.439698   0.267726  -1.642 0.100520    
## alc$studytime3 -1.342617   0.442054  -3.037 0.002388 ** 
## alc$studytime4 -1.279644   0.588173  -2.176 0.029583 *  
## alc$famrel2     0.936030   0.964897   0.970 0.332005    
## alc$famrel3     0.643055   0.883261   0.728 0.466585    
## alc$famrel4     0.272456   0.858654   0.317 0.751012    
## alc$famrel5    -0.006842   0.876748  -0.008 0.993773    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 427.24  on 372  degrees of freedom
## AIC: 447.24
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                       OR      2.5 %     97.5 %
## (Intercept)    0.3622376 0.04877998  1.8120704
## absences       1.0799706 1.03533984  1.1316272
## famsupyes      0.9406468 0.58232256  1.5284716
## alc$studytime2 0.6442312 0.38110876  1.0908572
## alc$studytime3 0.2611613 0.10353967  0.5966021
## alc$studytime4 0.2781364 0.07612213  0.8066302
## alc$famrel2    2.5498396 0.42150543 21.4386194
## alc$famrel3    1.9022832 0.37564722 14.2026417
## alc$famrel4    1.3131852 0.27385709  9.4789868
## alc$famrel5    0.9931812 0.19882276  7.3471751

According to the logistic regression model higher absences and studying more than 5 hours a week correlates with lower alcohol consumption. Lack of family support and quality of family relationships do not correlate significantly with alcohol cosumption in this model.

Explore the predictive power

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     94   20
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.66230366 0.03926702
##    TRUE  0.24607330 0.05235602
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.24607330 0.05235602 0.29842932
##    Sum   0.90837696 0.09162304 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2853403

According to the results, 29% of the predictions were wrong. This is clearly less than simple guessing (50% predictions wrong), although not optimal.

Bonus - Perform 10-fold cross-validation on your model

library(boot)
#cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
#cv
#cv$delta[1]

The test set performance is worse as the model makes 31% of predictions wrong.


Exercise 4 Data analysis

Load the data

#setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Explore the dataset

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston data frame has 506 rows and 14 columns. The data frame contains the following variables: crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat and medv.

Visualize the correlation matrix

pairs(Boston)

As the image is not very visual, let’s develop that with the corrplot package

library(dplyr)
cor_matrix<-cor(Boston) 
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
#install.packages("corrplot")
library(corrplot)
corrplot(cor_matrix %>% round(2), method="circle")

corrplot(cor_matrix %>% round(2), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

There seem to be a strong negative correlation between variables age (proportion of owner-occupied units built prior to 1940) and dis (weighted mean of distances to five Boston employment centres), nox (nitrogen oxides concentration) and dis, indus (proportion of non-retail business acres per town) and dis and lstat (lower status of the population) and medv (median value of owner-occupied homes in $1000s). On the other hand, there is strong positive correalation between indus and nox, indus and tax (full-value property-tax rate per $10,000), nox and age, nox and tax, rm (average number of rooms per dwelling) and medv and rad (index of accessibility to radial highways) and tax.

By looking at the difference between the medians and means and at their value respective to the min. and max. values, it seems that variables crim, zn, age, rad, tax, indus and black would be distributed non-parametrically and others perhaps parametrically. However, to make such conclusions, one should first run a test for normality (for instance Kolmogorov-Smirnov).

Standardize the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling reduces the mean of each of the values. Thus, after scaling, the variables have all 0 as a mean and it is easier to inspect the distribution. The closer the median is to 0, the more parametrical the variable. Scaling does not transform the data such as logarithmic trasformation.

Change the object to data frame and create a categorical variable of the crime rate

class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000

Create a quantile vector of crim

bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Create a categorical variable ‘crime’

crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

#remove original crim from the dataset
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Choose randomly 80% of the rows

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

Create train and test sets

train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

library(lda)
lda.fit <- lda(crime~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2698020 0.2450495 0.2277228 0.2574257 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.95681457 -0.9352889 -0.127848325 -0.8827690  0.4386072
## med_low  -0.08315566 -0.2430318  0.006051757 -0.5048785 -0.1512355
## med_high -0.38377579  0.1394693  0.198411178  0.3043928  0.1748530
## high     -0.48724019  1.0170690 -0.007331936  1.0781191 -0.3926115
##                 age        dis        rad        tax    ptratio      black
## low      -0.8927844  0.8879153 -0.6952813 -0.7317611 -0.4820477  0.3786823
## med_low  -0.2429419  0.2552198 -0.5572864 -0.4423375 -0.0816398  0.3490755
## med_high  0.4277372 -0.3446267 -0.4251144 -0.3287766 -0.2355170  0.1811425
## high      0.7991204 -0.8367862  1.6386213  1.5144083  0.7813507 -0.7125453
##                lstat        medv
## low      -0.78074623  0.51135373
## med_low  -0.08460568 -0.01894297
## med_high -0.07367346  0.24107520
## high      0.91891561 -0.72411156
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.116441085  0.69779190 -0.75458733
## indus    0.038513302 -0.31620459  0.39101804
## chas    -0.062071500 -0.05880809  0.11829047
## nox      0.437756382 -0.68614657 -1.54840350
## rm      -0.149262567 -0.06059328 -0.15826392
## age      0.224347127 -0.46597807 -0.18525818
## dis      0.001093547 -0.34425723 -0.02177413
## rad      3.370162180  0.86904316 -0.06551355
## tax      0.003589327  0.09553562  0.55778282
## ptratio  0.102061530  0.02177559 -0.26053904
## black   -0.139812591  0.04299155  0.22243764
## lstat    0.196710187 -0.17094701  0.45603266
## medv     0.184517380 -0.40604132 -0.25638771
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9560 0.0335 0.0104

The function for lda biplot arrows

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

Target classes as numeric

classes <- as.numeric(train$crime)

Plot the lda results

plot(lda.fit, dimen = 2)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Save the correct classes from test data and remove the crime variable

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        9       7        2    0
##   med_low    6      19        2    0
##   med_high   1      11       20    2
##   high       0       0        0   23

According to the results, the model predicts well the true values of the categorical crime variable. Most of the errors are related to the med_low group where the specificity is the lowest.

Calculate the distances between the observations and run k-means algorithm on the dataset

dist_eu <- dist(boston_scaled)
## Warning in dist(boston_scaled): NAs introduced by coercion
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5270  4.9080  4.9390  6.2420 13.0000
dist_man <- dist(boston_scaled, method = "manhattan")
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion

K-means algorithm

Determine and visualize the optimal number of clusters by first assigning a max of 10 clusters and then calculating the total within sum of squares

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

The optimal cluster number is 2. K-means cluster analysis

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

Bonus: Run k-means with 3 centers and perform LDA using the clusters as target classes

km <-kmeans(dist_eu, centers = 3)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4426877 0.2075099 0.3498024 
## 
## Group means:
##           zn      indus        chas        nox         rm        age
## 1 -0.2309344 -0.4551440 -0.27232907 -0.4915087 -0.1311176 -0.2883074
## 2  1.3140079 -0.9068835  0.47759479 -0.7455383  1.0548777 -0.7434166
## 3 -0.4872402  1.1139831  0.06132349  1.0642908 -0.4598408  0.8058735
##          dis        rad        tax      ptratio      black      lstat
## 1  0.2864417 -0.5768315 -0.6147205 -0.006886361  0.3154274 -0.2702405
## 2  0.7952165 -0.5935800 -0.6372993 -0.976736455  0.3398644 -0.8793299
## 3 -0.8342410  1.0821251  1.1560103  0.588134874 -0.6007995  0.8636357
##          medv crimemed_low crimemed_high crimehigh
## 1 -0.02274038   0.43750000     0.2589286 0.0000000
## 2  1.16160305   0.16190476     0.2761905 0.0000000
## 3 -0.66030777   0.06214689     0.2203390 0.7175141
## 
## Coefficients of linear discriminants:
##                        LD1         LD2
## zn             0.172610013 -1.25618179
## indus          1.078970813 -0.12226466
## chas           0.004463932 -0.49433521
## nox            0.921052125 -0.16586799
## rm            -0.040950198 -0.49922411
## age           -0.067015888  0.07618846
## dis            0.138379223 -0.06104503
## rad            0.579745765  0.43033468
## tax            0.401343141 -0.66310399
## ptratio        0.374368029  0.11727097
## black         -0.061203934  0.05315043
## lstat          0.376322744 -0.56677965
## medv           0.109367758 -0.69236403
## crimemed_low  -0.389791316  0.15910228
## crimemed_high -0.486827228 -0.58792893
## crimehigh     -0.207144258 -1.08365331
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7951 0.2049
plot(lda.fit, dimen = 2)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Using 3 clusters for K-means analysis, the most influencial linear separators are nox and chas. This means that if the data would be divided in 3 classes, variables nox and chas would explain most of the dimentional difference and would thus be the best variables to predict possible new observations.

Super-Bonus: Create a matrix product, which is a projection of the data points

model_predictors <- dplyr::select(train, -crime)
###Check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 16  2
###Matrix multiplication
#matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
#matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
#library("plotly")
###Create a 3D plot (Cool!) of the columns of the matrix product
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$center)

The same results as above is visualised in 3D using the variable crime in the first plot and the number of clusters selected in the second to separate observations with colors. For some reason, the visualisation does not work after knitting the file.


Exercise 4 Data analysis

Load the data

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
human2 <- read.csv("human2.csv")
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
colnames(human2)[1] <- "Country"
human2 <- human2[ ,2:9]
#human2 <- select(human2, -Country)
dim(human2)
## [1] 155   8
str(human2)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

There are 9 variables and 155 observations.

The original data is from http://hdr.undp.org/en/content/human-development-index-hdi and has been retrieved, modified and analyzed by Tuomo Nieminen. The data combines several indicators from most countries in the world.
“Country” = Country name
Health and knowledge
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
Empowerment
“Parli.F” = Percetange of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M

Show an overview of the data

human2 <- mutate(human2, GNI=str_replace(human2$GNI, pattern=",", replace ="") %>% as.numeric())
pairs(human2)

summary(human2)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(GGally)
library(corrplot)
#ggpairs(data=human2)
cor(human2) %>% corrplot()

Variables “Edu2.FM”, “Labo.FM”, “Life.Exp”, “Edu.Exp” and “Parli.F” seem to be parametrical as their mean and median are close to each others. On the other hand, “GNI”, “Mat.Mor” and “Ado.Birth” have quite differing median and mean values indicating the distribution is not parametrical.
Variables “Life.Exp” and “Edu.Exp”, “Mat.Mor” and “Ado.Birth” would have a positive correlation. Parameters “Life.Exp” and “Mat.Mor”, “Edu.Exp” and “Mat.Mor”, “Life.Exp” and “Ado.Birth”, and “Edu.Exp” and “Ado.Birth” have negative correlation.

Perform principal component analysis (PCA) on the not standardized human data

pca_human <- prcomp(human2)
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"))

Standardize the variables in the human data and repeat the PCA analysis

human_scaled <- scale(human2)
pca_human <- prcomp(human_scaled)
s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
#Rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
#Print out the percentages of variance
round(pca_pr*100, digits=1)
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pca_pr <- round(pca_pr*100, digits=1)
#Create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)"  "PC4 (7.6%)"  "PC5 (5.5%)" 
## [6] "PC6 (3.6%)"  "PC7 (2.6%)"  "PC8 (1.3%)"
biplot(pca_human, choices = 1:2, cex=c(0.8,1), col=c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The results differ. The PCA plot of the unscaled data suggest that there is only one variable that would explain the variance of the two principal component and it is unilateral to PC1. After scaling, all variables are displayed clearly in the plot and pointing each in different directions and with an almost equal amplitude. The scaling reduces the difference between values of different variables and thus makes the variables comparable.

Adolescent birth rate and maternal mortality points to an increasing PC1 and Life expectancy, education expectancy and the proportion of females to males with at least secondary education points to the opposing direction. This is understandable as the opposing arrows describe phenomena of developing and developed countries, respectively. Perpendicular to these variables are the percentage of female representatives in parliament and proportion of females to male in the labour force, which are pointing to an increasing principal component PC2.

Give your personal interpretations

PC1 describes a dimension of health and knowledge. Variables linear to PC1 either increases wellbeing or education, which have a high correlation together as was described earlier in the correlation matrix. PC2 describes female empowerment and is independent from the themes of PC1. The correlation plot sustain this notion as variables linear with PC2 correlate positively with each others but negatively with variables linear with PC1.

Load the tea dataset, explore the data and perform Multiple Correspondence Analysis (MCA)

Load the data

library(FactoMineR)
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

Select a small dataset

tea_time <- data.frame(tea$Tea, tea$How, tea$how, tea$sugar, tea$where, tea$lunch)
summary(tea_time)
##       tea.Tea     tea.How                  tea.how       tea.sugar  
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                 tea.where       tea.lunch  
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ tea.Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ tea.How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ tea.how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ tea.sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ tea.where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ tea.lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

Visualize the dataset

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +  geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

###Run MCA

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## tea.Tea            | 0.126 0.108 0.410 |
## tea.How            | 0.076 0.190 0.394 |
## tea.how            | 0.708 0.522 0.010 |
## tea.sugar          | 0.065 0.001 0.336 |
## tea.where          | 0.702 0.681 0.055 |
## tea.lunch          | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")

The variables are distributed to a biplot of two principal dimensions explaining 15.3% and 14.2% of the total variance, respectively.


Final Assignment

Abstract

In this assignment, I will analyse how alcohol consumption and other phenomena affect school grades. For the analysis, I will use linear regression model and the MCA plot. According to the linear regression model, students studying more and students with no romantic relationships are more likely to have higher school grades. The MCA plot nicely validates the role of romantic relationships.

Research Questions

Using the youth alcohol consumption data downloaded from the UCI Machine Learning Repository, I am going to study the variance of different variables using MCA and the effect of different variables to grades using linear regression.

My hypothesis is that high amount of ‘absences’, no ‘famsup’, low ‘studytime’ and bad ‘famrel’ correlates with a higher alcohol consumption

Data Wrangling

The data wrangling are found [here] (https://github.com/obruck/IODS-project/blob/master/data/alc_data%20wrangling.R)

The data files “mat” and “por” have been joined by using common variables as identifiers. Then, duplicate answers have been omitted. Three new variables have been created. “alc_use” is the total weekly alcohol consumption, “edu” stands for female and male education and “high_use” corresponds to students abusing high alcohol amounts.

Read and Explore the Data

setwd("/Users/obruck/Desktop/Open data course/IODS-project/data/")
alc <- read.csv("alc.csv")
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"   "edu"
dim(alc)
## [1] 382  37
str(alc)
## 'data.frame':    382 obs. of  37 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ edu       : int  8 2 2 6 6 7 4 8 5 7 ...

The data constitutes of 382 observations and 36 variables. They represent the correlation between alcohol usage and social, gender and study time attributes for each student.

Data Set Information:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Attribute Information:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 nursery - attended nursery school (binary: yes or no)
13 internet - Internet access at home (binary: yes or no)
14 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
15 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
16 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
17 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
18 schoolsup - extra educational support (binary: yes or no)
19 famsup - family educational support (binary: yes or no)
20 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
21 activities - extra-curricular activities (binary: yes or no)
22 higher - wants to take higher education (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
32 G2 - second period grade (numeric: from 0 to 20)
33 G3 - final grade (numeric: from 0 to 20, output target)
34 alc_use - average of ‘Dalc’ and ‘Walc’
35 high_use - TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

summary(alc)
##        X          school   sex          age        address famsize  
##  Min.   :  1.00   GP:342   F:198   Min.   :15.00   R: 81   GT3:278  
##  1st Qu.: 96.25   MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104  
##  Median :191.50                    Median :17.00                    
##  Mean   :191.50                    Mean   :16.59                    
##  3rd Qu.:286.75                    3rd Qu.:17.00                    
##  Max.   :382.00                    Max.   :22.00                    
##  Pstatus      Medu            Fedu             Mjob           Fjob    
##  A: 38   Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  T:344   1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##          Median :3.000   Median :3.000   other   :138   other   :211  
##          Mean   :2.806   Mean   :2.565   services: 96   services:107  
##          3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##          Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use            edu       
##  Mode :logical   Min.   :1.000  
##  FALSE:268       1st Qu.:4.000  
##  TRUE :114       Median :6.000  
##  NA's :0         Mean   :5.372  
##                  3rd Qu.:7.000  
##                  Max.   :8.000
library(dplyr)
library(corrplot)
alc_num <- data.frame(alc$age, alc$Medu, alc$Fedu, alc$traveltime, alc$studytime, alc$failures, alc$famrel, alc$freetime, alc$goout, alc$Dalc, alc$Walc, alc$health, alc$absences, alc$G1, alc$G2, alc$G3, alc$alc_use)
corrplot(cor(alc_num), method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

The data is composed of 198 females and 184 males. Normality of data is determined based on suffient n number (here n=382 is observed as sufficient) and on how close median and mean values are of each others. Variables “age”, “Medu”, “studytime”, “failures”, “famrel”, “freetime”, “goout”, “Dalc”, “Walc”, “G1”, “G2”, “G3”, “alc_use” are normally distributed. In addition, varibles “school”, “sex”, “address”, “famsize”, “Pstatus”, “Mjob”, “Fjob”, “reason”, “nursery”, “internet”, “guardian”, “schoolsup”, “famsup”, “paid”, “activities”, “higher”, “romantic”, “high_use” are binomial or multinomial.

As correlation plots can be done only with numeric variables, only these are encoded to the “alc_num” data.frame. Their correlation are then assessed with a correlation plot. The education level of mothers and fathers of student correlate positively. The same is seen between weekday and weekend alcohol consumption as well as the school grade of the first period, second period and the final grade. On the other hand, the number of past class failures correlate negatively with school grades.
The distribution of nominal variables are assessed next. As can be depicted, the number of males and females, the number of students with or without extracurricular activities and the number of students with or without extra paid classes within the course subject are roughly similar. However, more students live in urban areas, have a family size of >3 members, do not receive family educational support, do not consume high amounts of alcohol, want to take higher education, do not engage in romantic relationships, are in the Gabriel Pereira school and do not receive extra educational support.

library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
alc_nom <- data.frame(alc$school, alc$sex, alc$address, alc$famsize, alc$paid, alc$schoolsup, alc$famsup, alc$activities, alc$higher, alc$romantic, alc$high_use)
gather(alc_nom) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +  geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

Then, we explore the impact of the education of mothers, fathers and their summed education to alcohol use and final grade. Parent education doesn’t seem to affect alcohol grade but have a clear positive correlation with higher mean grades.

alc %>% group_by(Medu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
##    Medu count mean_alc_use
##   <int> <int>        <dbl>
## 1     0     3     2.166667
## 2     1    51     1.950980
## 3     2    98     1.739796
## 4     3    95     2.021053
## 5     4   135     1.874074
alc %>% group_by(Fedu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 5 × 3
##    Fedu count mean_alc_use
##   <int> <int>        <dbl>
## 1     0     2     1.000000
## 2     1    77     1.941558
## 3     2   105     1.895238
## 4     3    99     1.787879
## 5     4    99     1.959596
alc %>% group_by(edu) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 8 × 3
##     edu count mean_alc_use
##   <int> <int>        <dbl>
## 1     1     2     1.000000
## 2     2    36     1.958333
## 3     3    39     1.884615
## 4     4    68     1.933824
## 5     5    41     1.768293
## 6     6    63     1.888889
## 7     7    58     1.775862
## 8     8    75     1.993333
alc %>% group_by(Medu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 × 3
##    Medu count mean_grade
##   <int> <int>      <dbl>
## 1     0     3   12.66667
## 2     1    51   10.09804
## 3     2    98   10.79592
## 4     3    95   11.49474
## 5     4   135   12.40000
alc %>% group_by(Fedu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 × 3
##    Fedu count mean_grade
##   <int> <int>      <dbl>
## 1     0     2   13.50000
## 2     1    77   10.07792
## 3     2   105   11.56190
## 4     3    99   11.72727
## 5     4    99   12.11111
alc %>% group_by(edu) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 × 3
##     edu count mean_grade
##   <int> <int>      <dbl>
## 1     1     2  13.500000
## 2     2    36   9.861111
## 3     3    39  10.307692
## 4     4    68  10.867647
## 5     5    41  11.560976
## 6     6    63  12.063492
## 7     7    58  12.120690
## 8     8    75  12.226667

Statistical Methods

Linear regression is a statistical method to predict a continuous variable (dependent varaible) with one or multiple variables (independent variables). If only one explanatory variable is used, the model is called simple linear regression. Otherwise, the process is called multiple linear regression. Linear regression takes into account the multicollinearity of indenpendent models and computes their “true” correlation coefficient to the dependent variable. Eventually, one can make a formula y = b1x1 + … + bnxn + C, where y = response variable (dependent variable), x = covariates, b = correlation coefficient of covariates, C = error term.

Multiple correspondence analysis (MCA) is a visualisation method to analyse categorical data and the connections between different variables and their values. The data is transposed onto a two-dimensional plot where the x-axis and y-axis represent the components explaining the most of the variance of the data. Then, the values of the selected variable are distributed on the plot based on their correlation with the principal components and other variables. Unlike PCA or LDA, MCA compares the different types of values of different categorical variables and not the exact values of each observations.

Results

Linear Regression Analysis

First let’s run a linear regression analysis where the final grade is the response variable and gender, tendency to go out with friends, the number of absences, alcohol consumption, family educational support, studytime and romantic relationships are the independent variables.

alc$studytime <- factor(alc$studytime)
m <- lm(G3 ~ sex + goout + absences + alc_use + famsup + studytime + romantic, data = alc)
summary(m)
## 
## Call:
## lm(formula = G3 ~ sex + goout + absences + alc_use + famsup + 
##     studytime + romantic, data = alc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5079  -1.8658   0.2236   2.1942   7.2735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.09832    0.68053  17.778  < 2e-16 ***
## sexM         0.52857    0.36391   1.452 0.147207    
## goout       -0.28159    0.16077  -1.751 0.080691 .  
## absences    -0.01752    0.03153  -0.556 0.578758    
## alc_use     -0.27792    0.19608  -1.417 0.157210    
## famsupyes   -0.07760    0.34674  -0.224 0.823046    
## studytime2   0.90464    0.41559   2.177 0.030127 *  
## studytime3   1.94795    0.57101   3.411 0.000717 ***
## studytime4   1.57705    0.72135   2.186 0.029421 *  
## romanticyes -0.77491    0.36009  -2.152 0.032042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.218 on 372 degrees of freedom
## Multiple R-squared:  0.077,  Adjusted R-squared:  0.05467 
## F-statistic: 3.448 on 9 and 372 DF,  p-value: 0.0004245
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
##                       OR        2.5 %       97.5 %
## (Intercept) 1.795706e+05 4.710642e+04 6.845268e+05
## sexM        1.696507e+00 8.294443e-01 3.469956e+00
## goout       7.545849e-01 5.500615e-01 1.035154e+00
## absences    9.826307e-01 9.235547e-01 1.045485e+00
## alc_use     7.573586e-01 5.150550e-01 1.113652e+00
## famsupyes   9.253378e-01 4.679389e-01 1.829833e+00
## studytime2  2.471031e+00 1.091375e+00 5.594774e+00
## studytime3  7.014305e+00 2.282198e+00 2.155838e+01
## studytime4  4.840670e+00 1.171881e+00 1.999528e+01
## romanticyes 4.607442e-01 2.269591e-01 9.353457e-01

MCA

Then, we shall analyze visually the final school grade, gender, romantic relationships and extracurricular activities using the MCA plot.
####Transform the G3 variable to categorical

attach(alc)
alc$G3cat[G3 <= 10] <- "Low grade"
alc$G3cat[G3 > 10 & G3 <= 14] <- "Intermediate grade"
alc$G3cat[G3 > 14] <- "High grade"
detach(alc) 
str(alc)
## 'data.frame':    382 obs. of  38 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ edu       : int  8 2 2 6 6 7 4 8 5 7 ...
##  $ G3cat     : chr  "Low grade" "Low grade" "Intermediate grade" "Intermediate grade" ...

Run MCA

library(FactoMineR)
alc_mca <- data.frame(alc$G3cat, alc$romantic, alc$high_use, alc$sex, alc$activities)
mca <- MCA(alc_mca, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = alc_mca, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.253   0.226   0.208   0.195   0.173   0.146
## % of var.             21.067  18.812  17.344  16.229  14.386  12.163
## Cumulative % of var.  21.067  39.878  57.222  73.451  87.837 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.192  0.038  0.040 |  0.707  0.580  0.544 | -0.457
## 2                  | -0.192  0.038  0.040 |  0.707  0.580  0.544 | -0.457
## 3                  |  0.308  0.098  0.078 |  0.468  0.254  0.180 | -0.013
## 4                  | -0.601  0.374  0.320 | -0.087  0.009  0.007 |  0.853
## 5                  | -0.366  0.139  0.161 |  0.357  0.147  0.153 | -0.071
## 6                  |  0.103  0.011  0.013 | -0.679  0.535  0.563 |  0.171
## 7                  |  0.138  0.020  0.022 | -0.065  0.005  0.005 | -0.084
## 8                  | -0.192  0.038  0.040 |  0.707  0.580  0.544 | -0.457
## 9                  | -0.207  0.044  0.028 | -0.340  0.134  0.075 | -0.878
## 10                 |  0.103  0.011  0.013 | -0.679  0.535  0.563 |  0.171
##                       ctr   cos2  
## 1                   0.263  0.227 |
## 2                   0.263  0.227 |
## 3                   0.000  0.000 |
## 4                   0.915  0.644 |
## 5                   0.006  0.006 |
## 6                   0.037  0.036 |
## 7                   0.009  0.008 |
## 8                   0.263  0.227 |
## 9                   0.970  0.501 |
## 10                  0.037  0.036 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## High grade         |  -0.876  10.799   0.166  -7.954 |  -0.848  11.348
## Intermediate grade |  -0.010   0.003   0.000  -0.170 |  -0.196   1.520
## Low grade          |   0.428   5.425   0.110   6.462 |   0.637  13.478
## alc.romantic_no    |   0.159   1.361   0.054   4.549 |  -0.129   1.000
## alc.romantic_yes   |  -0.342   2.936   0.054  -4.549 |   0.277   2.157
## FALSE              |  -0.506  14.201   0.602 -15.138 |  -0.079   0.385
## TRUE               |   1.189  33.386   0.602  15.138 |   0.185   0.906
## F                  |  -0.611  15.285   0.401 -12.362 |   0.483  10.700
## M                  |   0.657  16.448   0.401  12.362 |  -0.519  11.514
## alc.activities_no  |   0.047   0.082   0.002   0.865 |   0.767  24.726
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## High grade           0.156  -7.705 |  -1.159  22.959   0.291 -10.523 |
## Intermediate grade   0.031  -3.440 |   0.653  18.323   0.345  11.468 |
## Low grade            0.243   9.625 |  -0.230   1.895   0.032  -3.466 |
## alc.romantic_no      0.036  -3.684 |  -0.483  15.303   0.503 -13.840 |
## alc.romantic_yes     0.036   3.684 |   1.041  33.010   0.503  13.840 |
## FALSE                0.015  -2.357 |  -0.040   0.105   0.004  -1.183 |
## TRUE                 0.015   2.357 |   0.093   0.247   0.004   1.183 |
## F                    0.251   9.774 |   0.015   0.011   0.000   0.303 |
## M                    0.251  -9.774 |  -0.016   0.012   0.000  -0.303 |
## alc.activities_no    0.530  14.215 |  -0.307   4.280   0.085  -5.679 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## alc.G3cat          | 0.205 0.297 0.449 |
## alc.romantic       | 0.054 0.036 0.503 |
## alc.high_use       | 0.602 0.015 0.004 |
## alc.sex            | 0.401 0.251 0.000 |
## alc.activities     | 0.002 0.530 0.085 |
plot(mca, invisible=c("ind"), habillage="quali")

Conclusions and Discussion

According to the linear regression model, students studying more and students with no romantic relationships are more likely to have higher school grades. Interestingly, the optimal studytime is between 5-10 hours/week. The adjusted R-squared value is 0.055 so the model explains only 5.5% of the variance of school grades. This is partly due to the high amount of variables included in the model. Probably other variables have an impact on school grades as well but their effect is reduced due to multicollinearity.

In the MCA plot, we can see that the two principal components (or dimensions) explain 21.1% and 18.8% of total variance. High school grade is located in the lower left quandrant (negative correlation with both principal components), intermediate grade near origo (little correlation) and low grades in the upper right quandrant (positive correlation with both principal components). Extracurricular activities (“alc.activies_yes”), low alcohol consumption (“FALSE”) and lack of romantic relationships (“alc.romantic_no”) correlate with higher school grades. On the other hand, gender (“M” and “F”) have no effect. The exact numeric correlations with the principal components can be seen the summary of the MCA analysis.

In this case, validating either method is quite challenging as there are no additional validation cohort. First, also studytime was included in the MCA plot, but the plot did not look aesthetic and therefore it was left out. On the other hand, the MCA plot nicely validates the results of the linear regression model emphasizing the negative role of romantic relationships to school grades. Luckily, life is more than school grades :)